Goal Misgeneralization
13 pages tagged "Goal Misgeneralization"
What are "mesa-optimizers"?
Can we test an AI to make sure it won't misbehave if it becomes superintelligent?
What is the difference between inner and outer alignment?
What is David Krueger working on?
What is Aligned AI / Stuart Armstrong working on?
How might interpretability be helpful?
What is perverse instantiation?
How is red teaming used in AI alignment?
What is inner alignment?
What is "Constitutional AI"?
What is adversarial training?
What is the "sharp left turn"?
But won't we just design AI to be helpful?